Honestly, IMHO he's not getting to the correct point. He's not truly describing what's going on.
If my processor has an instruction that sums 512 values from continous memory, it would have the same characteristics, and is not considered a tensor core. That is the "processor" knows it has to burst prefetch a specific chunk of memory, and afterwards can do a hierarchical parallel reduction (log2), which, if the ALUs are there, is just 8 cycles. These are single large instructions, mostly with interdependent immediate result-sharing. Almost every defined operation like this, would run faster. Not because there's less internal bandwidth used, but because the sitation is forced to be favorable for the task. If you count the bits going over the busses between the units, it's the same amount. The unit itself is a value buffer, it itself is a register, albeit anonymous one. If you take away a GPR and add a unit, you don't change the number of state holders. Say the vector (or matrix) is allowed to be scattered, then the mem-burst advantage goes away, and if the instruction sequence would be discretized, then the instruction "compression" advantage would go away, if the unit is only 1 instruction wide, then the parallelism advantage goes away. There are a myriad of examples of this, one would be the texture filtering block, the situations is forced to be favorable for the parallel reduction (locality, morton, cascade, ...). Same with rasterizer blocks, it executes virtual (or on some architectures real) instructions on groups of pixels/tiles, which again is favorable for tile-internal depth-reduction, tile-fetch burst and so on.
The question is, which instructions should be designed/offered. There are just too many possible candidates, and maybe we should turn to FPGAs for these things. There already are already small programmable units in todays processors. The different instructions one would use are often occuring together (that is the code stream is not stationary), so reprogramming on-the-fly isn't prohibitive. Holistically, this is better performing than having a selected set of fast instructions, as you have now an infinite set of not-so fast instructions, even if the FPGA is clocked lower. Say you program all filter kernels used in a compute/graphics pipeline into FPGA (DoF, other blur, soft shadow, etc. pp.) one stage after the other, even if the block is slower clocked, the net result is more performance.